The sound of crashing waves, the roar of fast-moving cars -- sound conveysimportant information about the objects in our surroundings. In this work, weshow that ambient sounds can be used as a supervisory signal for learningvisual models. To demonstrate this, we train a convolutional neural network topredict a statistical summary of the sound associated with a video frame. Weshow that, through this process, the network learns a representation thatconveys information about objects and scenes. We evaluate this representationon several recognition tasks, finding that its performance is comparable tothat of other state-of-the-art unsupervised learning methods. Finally, we showthrough visualizations that the network learns units that are selective toobjects that are often associated with characteristic sounds. This paperextends an earlier conference paper, Owens et al. 2016, with additionalexperiments and discussion.
展开▼